Advance Analytics with R (UG 21-24)
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
When there are many fancy things to try out!!
Therefore, we discuss linear models beyond least squares estimate.
Prediction Accuracy:
Model Interpretability
“This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.”
A least square model is fit for all possible combinations of p predictors.
So, if there are 3 predictors (\(p_1,p_2,p_3\)), we fit the following models:
model 1 \(y = \beta_a + \beta_1*p_1\)
model 2 \(y = \beta_b + \beta_2*p_2\)
model 3 \(y = \beta_c + \beta_3*p_3\)
.
.
model 7 \(y = \beta_0 + \beta_e*p_1 + \beta_f*p_2 + \beta_g*p_3\)
Select the best from these
A null model is defined with no predictors. Name it \(M_0\)
For each \(k\), where \(k=1,2,3..p\), select the best model from the all \((_k^p)\) combinations. Use RSS or \(R^2\). Call it \(M_k\)
Select a single best model from \(M_0, M_1,.....,M_p\). Use prediction error on a validation set,\(C_p\), AIC, BIC, adjusted \(R^2\) or use the cross validation method.
Issues:
Issues:
We generate several models using best subset selection, forward stepwise and backward stepwise.
We must choose carefully which models to pick. We ideally want a model that will have the lowest test error rate.
\(C_p\), AIC,BIC, Adjusted \(R^2\)
Recall that \(MSE = RSS/n\). Also, since we use least squares approach to choose coeffs in a manner where training RSS is minimized, we end up getting a training MSE is an underestimate of test MSE.
That is the reason \(R^2\) and training RSS are not suitable for selecting from many models.
Therefore we learn about \(C_p\), Akaike Information criterion (AIC), Bayesian Information Criterion (BIC) and Adjusted \(R^2\)
Say there is a least square model with \(d\) predictors.
The estimate of test error rate is given by:
\[C_p = \frac{1}{n}(RSS + 2d\hat\sigma^2)\]
\(\hat\sigma^2\) is the estimate of variance of \(\epsilon\), estimated using a full model.
AIC is defined for a range of maximum likelihood models.
\[AIC = 2d - 2ln(L)\]
\[AIC = \frac{1}{n}(RSS + 2d\hat\sigma^2)\]
BIC, for a least square model with d predictors:
\[BIC = \frac{1}{n}(RSS + log(n)d\hat\sigma^2 )\]
\[Adjusted\hspace{1mm} R^2 = \frac{RSS(n-d-1)}{TSS(n-1)}\]
See 6.5.1 in ISLR
Apply best subset approach for the Auto data in ISLR2 package.
Use mpg as response.
What is the best model? Feel free to use ggplot for charts and broom to collect stats from model object.
See 6.5.1 in ISLR
Apply validation set approach on Auto data in ISLR2 package.
Use mpg as response.
Which is the best model?
In subsetset selection methods we are working with least squares.
Alternatively, we try to reduce the coefficients in a full model to zero. This is referred to shrinkage, constraining or regularization.
This reduces variance of coefficients, leading to a better bit.
Ridge and Lasso are two techniques that allow us to do this.